convolutional vision transformer
ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages
--Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. T o address these limitations, we propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers. ECViT introduces inductive biases such as locality and translation invariance, inherent to Convolutional Neural Networks (CNNs) into the Transformer framework by extracting patches from low-level features and enhancing the encoder with convolutional operations. Additionally, it incorporates local-attention and a pyramid structure to enable efficient multi-scale feature extraction and representation. Experimental results demonstrate that ECViT achieves an optimal balance between performance and efficiency, outperforming state-of-the-art models on various image classification tasks while maintaining low computational and storage requirements. ECViT offers an ideal solution for applications that prioritize high efficiency without compromising performance. Transformers use self-attention [1] to model long-range dependencies, revolutionizing how models handle sequential data. The Vision Transformer (ViT) [2] treats images as sequences of patches and uses the self-attention to capture global dependencies, which has made a successful transition from natural language processing (NLP) to computer vision (CV).
A Convolutional Vision Transformer for Semantic Segmentation of Side-Scan Sonar Data
Rajani, Hayat, Gracias, Nuno, Garcia, Rafael
Distinguishing among different marine benthic habitat characteristics is of key importance in a wide set of seabed operations ranging from installations of oil rigs to laying networks of cables and monitoring the impact of humans on marine ecosystems. The Side-Scan Sonar (SSS) is a widely used imaging sensor in this regard. It produces high-resolution seafloor maps by logging the intensities of sound waves reflected back from the seafloor. In this work, we leverage these acoustic intensity maps to produce pixel-wise categorization of different seafloor types. We propose a novel architecture adapted from the Vision Transformer (ViT) in an encoder-decoder framework. Further, in doing so, the applicability of ViTs is evaluated on smaller datasets. To overcome the lack of CNN-like inductive biases, thereby making ViTs more conducive to applications in low data regimes, we propose a novel feature extraction module to replace the Multi-layer Perceptron (MLP) block within transformer layers and a novel module to extract multiscale patch embeddings. A lightweight decoder is also proposed to complement this design in order to further boost multiscale feature extraction. With the modified architecture, we achieve state-of-the-art results and also meet real-time computational requirements. We make our code available at ~\url{https://github.com/hayatrajani/s3seg-vit
Vision Transformers or Convolutional Neural Networks? Both!
The field of Computer Vision has for years been dominated by Convolutional Neural Networks (CNNs). Through the use of filters, these networks are able to generate simplified versions of the input image by creating feature maps that highlight the most relevant parts. These features are then used by a multi-layer perceptron to perform the desired classification. But recently this field has been incredibly revolutionized by the architecture of Vision Transformers (ViT), which through the mechanism of self-attention has proven to obtain excellent results on many tasks. If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Vision Transformers or Convolutional Neural Networks? Both!
Through the use of filters, these networks are able to generate simplified versions of the input image by creating feature maps that highlight the most relevant parts. These features are then used by a multi-layer perceptron to perform the desired classification. But recently this field has been incredibly revolutionized by the architecture of Vision Transformers (ViT), which through the mechanism of self-attention has proven to obtain excellent results on many tasks. In this article some basic aspects of Vision Transformers will be taken for granted, if you want to go deeper into the subject I suggest you read my previous overview of the architecture. Although Transformers have proven to be excellent replacements for CNNs, there is an important constraint that makes their application rather challenging, the need for large datasets.